#! This lab is worth 11 pts and is due at 5pm on Friday, 11/6 !#
In this lab, we will use species occurrence data and environmental raster layers to construct ecological niche models (ENMs) for several species. Goals– 1: To explore two methods for constructing ecological niche models (ENMs) and how to interpret them. 2: To project models onto the landscape to evaluate the relationship between environmental space and geographic space.
Load required packages…
There are many methods for constructing ENMs. The simplest is to use a generalized linear model (GLM), a form of multiple regression, which we’ve seen before. So, we’ll start this lab by constructing a GLM ENM for the California red-legged frog (CRLF), Rana draytonii, the state amphibian of California.
First, we need to load the species occurrence points…
CRLF.occ <- read.csv("Rana_draytonii.csv")
head(CRLF.occ)
## longitude latitude
## 1 -121.8405 36.92584
## 2 -121.5008 36.34028
## 3 -121.5932 37.64074
## 4 -121.5549 36.38550
## 5 -121.7865 36.83307
## 6 -122.2518 37.90269
Let’s also read in a shapefile for the county borders in California…
And now let’s plot the occurrence points…
plot(counties, border="gray50")
points(CRLF.occ, pch=21, bg="blue")
These are the presence points for the CRLF (i.e. the places where the frogs are found). We can build a model with just these points, but it can also be valuable to know where they don’t occur (i.e. absence points). It’s always difficult to demonstrate that something doesn’t occur somewhere, so usually we don’t have good absence information. That means that we use ‘pseudo-absence’ points instead - they stand in for true absence points (a rough approximation).
We can generate a set of pseudo-absence points by randomly choosing points in California using the spsample() function.
spsample(x, n, type, …) Arguments x: Spatial object; spsample(x,…) is a generic method for the existing sample.Xxx functions n: sample size type: character; “random” for completely spatial random; “regular” for regular (systematically aligned) sampling; “stratified” for stratified random (one single random location in each “cell”); “hexagonal” for sampling on a hexagonal lattice; “clustered” for clustered sampling.
We want to generate the same number of pseudo-absence points as we have presence points, so we’ll set n to the number of points in the CRLF dataset.
set.seed(123)
CRLF.abs <- spsample(counties, n = nrow(CRLF.occ), type = "random")
## Warning in proj4string(obj): CRS object has comment, which is lost in output
Then plot the presence and absence points…
plot(counties, border = "gray50")
points(CRLF.occ, pch = 21, bg = "blue")
points(CRLF.abs, pch = 21, bg = "red")
The second set of data we need for constructing an ENM is the environmental data. So, we also need to read in a set of bioclimate rasters from the WorldClim database…
setwd("./present")
wclim <- stack(list.files())
plot(wclim)
In this case, we will only use six of the available data layers today.
bio1: Annual Mean Temperature bio2: Mean Diurnal Temperature Range bio7: Annual Temperature Range bio12: Annual Precipitation bio15: Precipitation Seasonality bio16: Precipitation of Wettest Quarter
As you can see, these WorldClim rasters cover more than just California, so we’ll use a mask to clip them to the state boundary.
wclim <- mask(wclim, counties)
plot(wclim)
And now we can extract the values for our presence and absence points from the rasters…
# Extract the presence points
env.pres <- extract(wclim, CRLF.occ)
# Extract the absence points
env.abs <- extract(wclim, CRLF.abs)
And then we need to make them into a dataframe for the next part of the analysis. In this step, we’ll also add a column that says whether the point is a presence (occ = 1) or absence (occ = 0) point.
env.pres <- data.frame(occ = 1, env.pres)
env.abs <- data.frame(occ = 0, env.abs)
# Combine the two dataframes
env <- rbind(env.pres, env.abs)
env[c(1, 2, 3, 101, 102, 103),]
## occ bio1 bio12 bio15 bio16 bio2 bio7
## 1 1 138 616 90 337 124 206
## 2 1 123 666 88 353 154 262
## 3 1 150 476 84 244 145 304
## 101 0 103 1752 77 897 133 263
## 102 0 172 174 80 94 162 347
## 103 0 163 529 90 294 157 333
We should check whether any of the predictor variables (the WorldClim variables) are collinear (i.e. correlated). We can do that using the pairs() function, which will plot all of the extracted values for each pair of variables in our dataset.
pairs(env[,-1])
Do any of the variables look like they’re correlated? Which ones? > Mean Diurnal Temperature Range (bio2) and Annual Temperature Range (bio7) are correlated, and Annual Precipitation (bio12) and Precipitation of Wettest Quarter (bio16) are correlated.
The next step in constructing an ENM is model fitting. For this step, we will describe the formula for the relationships among the variables in our dataset. In R, the tilde (~) stands in for the equal sign (=) in a mathematical formula. Here, we’ll set the occurrence (presence or absence) as the response variable and the bioclimate variables as the predictors. When we know what formula we want, we then use the glm() function to construct a generalized linear model.
glm(formula, data, …) Arguments formula: an object of class “formula”: a symbolic description of the model to be fitted. data: the data to be used to fit the model
enm1 <- glm(occ ~ bio1 + bio2 + bio7 + bio12 + bio15 + bio16, data = env)
summary(enm1)
##
## Call:
## glm(formula = occ ~ bio1 + bio2 + bio7 + bio12 + bio15 + bio16,
## data = env)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.94601 -0.23076 0.04009 0.21266 0.79094
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.1009999 0.4545780 2.422 0.01659 *
## bio1 -0.0001058 0.0012171 -0.087 0.93082
## bio2 0.0087814 0.0031222 2.813 0.00555 **
## bio7 -0.0069833 0.0013883 -5.030 1.34e-06 ***
## bio12 -0.0014936 0.0015171 -0.985 0.32640
## bio15 0.0053781 0.0042779 1.257 0.21059
## bio16 0.0022131 0.0029607 0.747 0.45591
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.1347119)
##
## Null deviance: 40.50 on 161 degrees of freedom
## Residual deviance: 20.88 on 155 degrees of freedom
## AIC: 143.83
##
## Number of Fisher Scoring iterations: 2
Based on the output of this model, which predictor variables are significant? > Based on the output, Mean Diurnal Temperature Range (bio2) and Annual Temperature Range (bio7) are significant.
We can also fit a different model using a subset of the predictor variables…
enm2 <- glm(occ ~ bio1 + bio2 + bio7 + bio12 + bio15, data = env)
summary(enm2)
##
## Call:
## glm(formula = occ ~ bio1 + bio2 + bio7 + bio12 + bio15, data = env)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.97198 -0.23599 0.04073 0.20891 0.76385
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.9012415 0.3672052 2.454 0.015214 *
## bio1 0.0003921 0.0010171 0.386 0.700356
## bio2 0.0091560 0.0030774 2.975 0.003394 **
## bio7 -0.0072187 0.0013502 -5.346 3.14e-07 ***
## bio12 -0.0003624 0.0001060 -3.418 0.000805 ***
## bio15 0.0072903 0.0034239 2.129 0.034801 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.1343308)
##
## Null deviance: 40.500 on 161 degrees of freedom
## Residual deviance: 20.956 on 156 degrees of freedom
## AIC: 142.42
##
## Number of Fisher Scoring iterations: 2
Which is the better model? How do you know this is the better model, and what do you think makes it the better model? What is the relationship between bio12 and bio16? >The second model is better because the AIC score is lower. Variables bio12 and bio16 very postively correlated (they almost dictate each other completely)
Look back at the pairs() output. Is there another pair of variables that appear to be correlated? Construct a new ENM using glm() that removes one of them, and examine the results of the model.
enm3 <- glm(occ ~ bio1 + bio2 + bio7 + bio12, data = env)
summary(enm3)
##
## Call:
## glm(formula = occ ~ bio1 + bio2 + bio7 + bio12, data = env)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.96258 -0.25532 -0.01461 0.21818 0.81711
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.4215604 0.2771577 5.129 8.48e-07 ***
## bio1 0.0010126 0.0009854 1.028 0.30573
## bio2 0.0129025 0.0025530 5.054 1.19e-06 ***
## bio7 -0.0093185 0.0009325 -9.993 < 2e-16 ***
## bio12 -0.0002921 0.0001019 -2.867 0.00471 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.1373544)
##
## Null deviance: 40.500 on 161 degrees of freedom
## Residual deviance: 21.565 on 157 degrees of freedom
## AIC: 145.06
##
## Number of Fisher Scoring iterations: 2
Which is the best model of the three? Assign that model to the CRLF.enm objected below.
CRLF.enm <- enm3
Now that we have fit several models and selected the best one, we can move on to the next stage: model evaluation. R makes this relatively easy using the evaluate() function.
evaluate(p, a, model, x, …) Arguments p: presence points (x and y coordinates or SpatialPoints* object). a: absence points (x and y coordinates or SpatialPoints* object). model: any fitted model x: Optional. Predictor variables (object of class Raster*). If present, p and a are interpreted as (spatial) points
eval <- evaluate(p = CRLF.occ, a = CRLF.abs, model = CRLF.enm, x = wclim)
eval
## class : ModelEvaluation
## n presences : 81
## n absences : 81
## AUC : 0.8968907
## cor : 0.6837689
## max TPR+TNR at : 0.3854316
Now that we have created a ModelEvaluation object, we can check whether the ENM predicts presence points (red) as having a higher value than absence points (blue). i.e. Presence points should occur in environmental conditions with higher suitability. Let’s first look at a density plot of the results…
density(eval)
We can also view a box plot of the results…
boxplot(eval, col = c("blue", "red"))
Finally, we can graph the receiver operating characterstic (ROC) curve for the results, a graph of how often the ENM models the presence of the organism correctly. The red circles and lines represent the values from our model, and the gray line is the values expected under a null model.
plot(eval, 'ROC')
Examine the axis labels for the ROC plot. Does the model we constructed appear to be a better than the null model? Explain why. i.e. What specifically does the ROC curve show us (don’t just write that it’s above the line or reiterate the description from above)? > The model we constructed appears to be better than the null model. It is showing us that the ENM models the presence of the organism with True Postives more often than not.
It can also be helpful to visualize which part of environmental niche space the species occupies. To do that, we can plot the values for two of the variables (here, we’ll use annual mean temperature and annual precipitation) for the California red-legged frog occurrence points (in blue) compared to the random background points that represent the total range of available environmental conditions in California (in red). It’s important to note, however, that this is only two of the variables we have included in our model, and environmental niche space is actually much more complex than this simplification. Nevertheless, it gives us a general sense of the thermal niche and precipitation niche for this species.
plot(env.abs$bio1, env.abs$bio12, main = "Environmental Space", xlab = "Annual Mean Temperature (Bio1)", ylab = "Annual Precipitation (Bio12)", col = "red", pch = 20)
points(env.pres$bio1, env.pres$bio12, col = "blue", pch = 20)
plot(convexHull(cbind(env.abs$bio1[-9], env.abs$bio12[-9])), border = "red", add = TRUE)
plot(convexHull(cbind(env.pres$bio1, env.pres$bio12)), border = "blue", add = TRUE)
Finally, now that we have selected an optimal model and evaluated its performance, we can use it for prediction. In the climate change lab, we used a model to predict temperature at other points in time, but here we’ll use the model to predict whether CRLFs occur at other points in space (i.e. places where we don’t have observations in our dataset). We can do this using the predict() function.
predict(object, x, …) Arguments object: a fitted model x: a Raster* object or a data.frame
enm.pred <- predict(wclim, CRLF.enm)
enm.pred
## class : RasterLayer
## dimensions : 261, 269, 70209 (nrow, ncol, ncell)
## resolution : 0.04166667, 0.04166667 (x, y)
## extent : -125, -113.7917, 32.25, 43.125 (xmin, xmax, ymin, ymax)
## crs : NA
## source : memory
## names : layer
## values : -0.2429289, 1.369588 (min, max)
You can see above that when we apply the predict() function, it generates a new RasterLayer. That raster contains the predictions for where the species occurs - the value of each cell is the probability of occurrence. These values should range from 0 to 1, but the GLM will actually cause some values to exceed this range. So, we need to replace these values with 0s (if they’re < 0) and 1s (if they’re > 1).
enm.pred[enm.pred < 0] <- 0
enm.pred[enm.pred > 1] <- 1
And then we can plot the results…
# Plot the results
sdmScheme <- colorRampPalette(c("blue", "green", "yellow", "orange", "red"), space = "rgb", interpolate = "linear")
plot(enm.pred, col = sdmScheme(100), main = "Occurrence Probability for CRLF")
points(CRLF.occ, col = "blue")
points(CRLF.abs, col = "red")
We generally interpret the occurrence probabilities as “habitat suitability” values, because more suitable environmental conditions should mean the species is more likely to be found there.
Generally speaking, how are the occurrence probabilites spatially distributed? Which regions have the most suitable habitat for California red-legged frogs? >The occurance probabillities are distributed along the entire coast of California, with the highest probability along the central coast (most suitable habitat).
Let’s also create a GLM ENM for another species, the California grizzly (go bears!), the official state animal of California, which unfortunately was hunted to extinction in 1922.
Load occurrence points for the California grizzly…
grizzly.occ <- read.csv("Ursus_arctos_californicus.csv")
And select pseudo-absence points…
set.seed(234)
grizzly.abs <- spsample(counties, n = nrow(grizzly.occ), type = "random")
## Warning in proj4string(obj): CRS object has comment, which is lost in output
And plot the points…
plot(counties, border = "gray50")
points(grizzly.occ, pch = 21, bg = "blue")
points(grizzly.abs, pch = 21, bg = "red")
** Extract the values for the presence and absence points, make them into a dataframe, and check for correlations between the predictor variables.**
# Extract the presence points
bear.pres <- extract(wclim, grizzly.occ)
# Extract the absence points
bear.abs <- extract(wclim, grizzly.abs)
bear.pres <- data.frame(occ = 1, bear.pres)
bear.abs <- data.frame(occ = 0, bear.abs)
# Combine the two dataframes
bear <- rbind(bear.pres, bear.abs)
bear[c(1, 2, 3, 70, 71, 72),]
## occ bio1 bio12 bio15 bio16 bio2 bio7
## 1 1 141 484 86 265 139 309
## 2 1 163 480 96 278 118 215
## 3 1 152 528 89 297 140 297
## 70 0 138 549 74 273 161 324
## 71 0 147 337 81 175 185 361
## 72 0 161 228 51 97 162 364
Now, fit some GLMs using three sets of the predictor variables and examine the results.
pairs(bear[,-1])
gnm1 <- glm(occ ~ bio1 + bio2 + bio7 + bio12 + bio15 + bio16, data = bear)
gnm2 <- glm(occ ~ bio1 + bio2 + bio7 + bio12 + bio15, data = bear)
gnm3 <- glm(occ ~ bio2 + bio7 + bio12 + bio15, data = bear)
summary(gnm3)
##
## Call:
## glm(formula = occ ~ bio2 + bio7 + bio12 + bio15, data = bear)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8104 -0.3764 0.1091 0.3183 0.9074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0475954 0.6290005 0.076 0.93989
## bio2 -0.0125959 0.0046361 -2.717 0.00822 **
## bio7 0.0035281 0.0022732 1.552 0.12498
## bio12 0.0004412 0.0001453 3.036 0.00332 **
## bio15 0.0119617 0.0050194 2.383 0.01977 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.1896898)
##
## Null deviance: 19.500 on 77 degrees of freedom
## Residual deviance: 13.847 on 73 degrees of freedom
## AIC: 98.522
##
## Number of Fisher Scoring iterations: 2
Which is the optimal model? Assign it to the grizzly.enm object below, and then generate the raster of occurrence probabilies.
grizzly.enm <- gnm3
grizzly.pred <- predict(wclim, grizzly.enm)
And then we can plot the results…
grizzly.pred[grizzly.pred < 0] <- 0
grizzly.pred[grizzly.pred > 1] <- 1
# Plot the results
sdmScheme <- colorRampPalette(c("blue", "green", "yellow", "orange", "red"), space = "rgb", interpolate = "linear")
plot(grizzly.pred, col = sdmScheme(100), main = "Occurrence Probability for the California Grizzly", zlim = c(0, 1))
points(grizzly.occ, col = "blue")
points(grizzly.abs, col = "red")
And then evaluate the model and check the ROC curve…
eval <- evaluate(p = grizzly.occ, a = grizzly.abs, model = grizzly.enm, x = wclim)
plot(eval, 'ROC')
So, the GLM ENM we made for the California grizzly is ok, but there’s not as big a difference between the presence and absence values as we’d like to see for a really good model. The GLM method is too simple for many species, so we should use a more sophisticated method instead. One that overcomes some of the limitations of the GLM approach is the Bioclim method. The biggest difference is the GLM looks for linear relationships between the predictor variables and the occurrence points, whereas the Bioclim methods tries to full a bell-curve shape to those relationships.
We can see what that looks like if we use the response() function to show us the relationships between the predictor variables and the probability of occurrence for the California red-legged frog GLM ENM.
response(CRLF.enm)
Now, for comparison, let’s fit a Bioclim ENM to the same CRLF data using the bioclim() function. However, the Bioclim method only uses presence points, so we’ll drop the pseudo-absence points from this analysis.
bioclim(x, p, …) Arugments x: Raster* object or matrix of predictor variables p: two column matrix or SpatialPoints* object of occurrence points
CRLF.bc.enm <- bioclim(wclim, CRLF.occ)
And then we can examine the relationship of each variable…
response(CRLF.bc.enm)
Now, fit a Bioclim ENM for the California grizzly…
grizzly.bc.enm <- bioclim(wclim, grizzly.occ)
grizzly.bc.pred <- predict(wclim, grizzly.bc.enm)
plot(grizzly.bc.pred, col = sdmScheme(100), zlim=c(0,1),
main = "Habitat Suitability for U. arctos californicus")
And check the ROC plot…
bc.eval <- evaluate(p = grizzly.occ, a = grizzly.abs, model = grizzly.bc.enm, x = wclim)
plot(bc.eval, 'ROC')
So, now we clearly have a better model for the California grizzly. That means we can also use our model to project into the future. We’ll first load a set of the same WorldClim variables but for the year 2070 instead of present day.
setwd("./future")
wclim2070 <- stack(list.files())
wclim2070 <- mask(wclim2070, counties)
And then we can use the predict() function to project where suitable habitat for the California grizzly would be in 2070.
grizzly.bc.2070 <- predict(wclim2070, grizzly.bc.enm)
# Plot the ENMs for 2019 and 2070 side by side
par(mfrow = c(1, 2))
plot(grizzly.bc.pred, col = sdmScheme(100), main = "Occurrence Probability (2019)", zlim = c(0, 1))
plot(grizzly.bc.2070, col = sdmScheme(100), main = "Occurrence Probability (2070)", zlim = c(0, 1))
Generally speaking, what happens to the distribution of potential habitat/environmental conditions for the California grizzly under climate change in 2070? Make sure you include the geographic context in your answer. > Under climate change, the distribution of potential habitat/environmental conditions for the California grizzly decreases/becomes narrower in 2070. This is shown by the reduced spatial distribution of higher values of occurence probability and increased spatial distribution of lower values of occurence probability on the 2070 map comapred to 2019.
Finally, let’s plot the Bioclim models for the CRLF and California grizzly side by side.
CRLF.bc.pred <- predict(wclim, CRLF.bc.enm)
par(mfrow = c(1, 2))
plot(CRLF.bc.pred, col = sdmScheme(100), main = "Occurrence Probability (CRLF)", zlim = c(0, 1))
plot(grizzly.bc.pred, col = sdmScheme(100), main = "Occurrence Probability (Grizzly)", zlim = c(0, 1))
It looks like they might have a few things in common, but it’s hard to say exactly just by looking at the two maps. If we want to quantify how similar their environmental niches are, we can calculate a metric of niche overlap. We can do that using the nicheOverlap() function.
nicheOverlap(x, y, stat=‘I’, …) Arguments x: RasterLayer with non-negative values (predictions of the probability that a site is suitable for a species) y: RasterLayer with non-negative values, as above stat: character either ‘I’ or ‘D’
D.obs <- nicheOverlap(CRLF.bc.pred, grizzly.bc.pred, stat="D")
D.obs
## [1] 0.3287221
Schoener’s D statistic ranges from 0 to 1, with 0 being no overlap and 1 indicated perfect overlap. So, this gives us a fairly low value of environmental niche overlap for the California grizzly and CRLF. However, this may just be because they have different ranges, and occupying different parts of geographic space may mean that you occupy different parts of environmental space just by chance. So, to test whether this moderately low level of niche divergence is less than expected by chance, we should construct a null distribution to compare it to.
First, we’ll draw a convex hull around the points for each species to give us their ranges.
CRLF.range <- convexHull(CRLF.occ)
grizzly.range <- convexHull(grizzly.occ)
plot(counties, border = "gray50")
plot(CRLF.range, border = "green4", lwd = 2, add = TRUE)
plot(grizzly.range, border = "brown", lwd = 2, add = TRUE)
Now, write a loop that will do the following steps 50 times: (1) generate random sets of 50 points from each of the convex hulls, (2) construct a Bioclim ENM for each set of points, and (3) calculate Schoener’s D for niche overlap for the two ENMs. Assign the results to a vector named D.null so that we can plot them later.
set.seed(789)
D.null <- c() # This vector will contain the comparisons of niche divergence between both geographic backgrounds
for (i in 1:50) {
CRLF.hull <- spsample(CRLF.range, n = 50, type = "random")
grizzly.hull <- spsample(grizzly.range, n = 50, type = "random")
CRLF.enm <- bioclim(wclim, CRLF.hull)
grizzly.enm <- bioclim(wclim, grizzly.hull)
CRLF.pred <- predict(wclim, CRLF.enm)
grizzly.pred <- predict(wclim, grizzly.enm)
D.n <- nicheOverlap(CRLF.pred, grizzly.pred, stat="D")
D.null <- c(D.null, D.n)
}
And plot the results as a histogram for the background values of Schoener’s D and a blue line for the observed value…
hist(D.null, main = "Niche Overlap vs Background", xlim = c(0, 1))
abline(v = D.obs, col = "blue", lwd = 2)
Based on this plot, does it look like the niche overlap between the California grizzly and California red-legged frog is less than or more than expected by chance? Explain why. (Think about the observed value relative to the null distribution.) > Based on this plot, it looks like the niche overlap between the California grizzly and California red-legged frog is less than expected by chance because there are definitely more factors (biotic and abiotic) that are influencing the niche environment and survival of both species in specific areas of California.